PROJECT ON DIABETES PREDICTION¶

In [ ]:
 
In [ ]:
 

The objective of the project is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

Importing libraries like Pandas, Numpy, Matplotlib, Seaborn, Scikit learn(Sklearn)¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix,classification_report

The objective of this diabetes dataset is to predict whether patient has diabetes or not. The dataset consists of several medical predictor (Independent variable) and one Target variable(Outcome).Predictor variables include pregnancies,Glucose,Blood Pressure, Skin Thickness, Insulin, BMI, DiabetesPedigreeFunction,Age and Outcome

In [2]:
df=pd.read_csv('C:\sample\diabetes.csv')
df
Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

Exploratory Data Analysis¶

Exploratory Data Analysis (EDA) is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.

The head() returns the top rows for the object based on position.¶

In [3]:
df.head()
Out[3]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

The tail() function shows the bottom rows.¶

In [4]:
df.tail()
Out[4]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

The sample() is used to generate a sample random row or column from the function caller data frame¶

In [5]:
df.sample(7)
Out[5]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
330 8 118 72 19 0 23.1 1.476 46 0
223 7 142 60 33 190 28.8 0.687 61 0
566 1 99 72 30 18 38.6 0.412 21 0
136 0 100 70 26 50 30.8 0.597 21 0
307 0 137 68 14 148 24.8 0.143 21 0
401 6 137 61 0 0 24.2 0.151 55 0
55 1 73 50 10 0 23.0 0.248 21 0

The shape() function gives us the number of rows and columns of the dataset¶

In [6]:
df.shape
Out[6]:
(768, 9)

Number of rows = 768

Number of columns = 9

The info() is used to check the information about the data and the datatypes of each attribute¶

In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame¶

In [8]:
df.describe()
Out[8]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

In the above table, the min value of columns ‘Glucose’,’BloodPressure’,’Skin Thickness’,’Insulin’,’BMI’, is zero(0). It is clear that this values can’t be zero. So I am going to impute mean values of these columns instead of zero.

Data Cleaning¶

isnull() methods are used to check and manage NULL values in a data frame.¶

In [9]:
df.isnull()
Out[9]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
... ... ... ... ... ... ... ... ... ...
763 False False False False False False False False False
764 False False False False False False False False False
765 False False False False False False False False False
766 False False False False False False False False False
767 False False False False False False False False False

768 rows × 9 columns

The function dataframe. isnull(). sum(). sum() returns the number of missing values in the dataset.¶

In [10]:
df.isnull().sum()
Out[10]:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
In [11]:
df.isnull().sum().sum()
Out[11]:
0

There is no NULL values in the given dataset

In [12]:
df.columns
Out[12]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

Check the no. of ZERO values in dataset¶

In [13]:
print('No of zero values in Glucose',df[df['Glucose']==0].shape[0])
No of zero values in Glucose 5
In [14]:
print('No of zero values in BloodPressure',df[df['BloodPressure']==0].shape[0])
No of zero values in BloodPressure 35
In [15]:
print('No of zero values in SkinThickness',df[df['SkinThickness']==0].shape[0])
No of zero values in SkinThickness 227
In [16]:
print('No of zero values in Insulin',df[df['Insulin']==0].shape[0])
No of zero values in Insulin 374
In [17]:
print('No of zero values in BMI',df[df['BMI']==0].shape[0])
No of zero values in BMI 11

Replace no. of ZERO values with mean of that columns¶

In [18]:
df['Glucose']=df['Glucose'].replace(0,df['Glucose'].mean())
print('No.of zero value in Glucose',df[df['Glucose']==0].shape[0])
No.of zero value in Glucose 0
In [19]:
df['BloodPressure']=df['BloodPressure'].replace(0,df['BloodPressure'].mean())
print('No.of zero value in BloodPressure',df[df['BloodPressure']==0].shape[0])
No.of zero value in BloodPressure 0
In [20]:
df['SkinThickness']=df['SkinThickness'].replace(0,df['SkinThickness'].mean())
print('No.of zero value in SkinThickness',df[df['SkinThickness']==0].shape[0])
No.of zero value in SkinThickness 0
In [21]:
df['Insulin']=df['Insulin'].replace(0,df['Insulin'].mean())
print('No.of zero value in Insulin',df[df['Insulin']==0].shape[0])
No.of zero value in Insulin 0
In [22]:
df['BMI']=df['BMI'].replace(0,df['BMI'].mean())
print('No.of zero value in BMI',df[df['BMI']==0].shape[0])
No.of zero value in BMI 0
In [23]:
df.describe()
Out[23]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 121.681605 72.254807 26.606479 118.660163 32.450805 0.471876 33.240885 0.348958
std 3.369578 30.436016 12.115932 9.631241 93.080358 6.875374 0.331329 11.760232 0.476951
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000 0.000000
25% 1.000000 99.750000 64.000000 20.536458 79.799479 27.500000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 79.799479 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Data Visualization¶

Visualize the location of missing values¶

heatmap() to make a heatmap of the data to visualize the missing data in each variable.

In [24]:
plt.figure(figsize=(25,25))
sns.heatmap(df.isnull())
Out[24]:
<Axes: >

count plot¶

countplot() method is used to Show the counts of observations in each categorical bin using bars.¶

A Pie Chart is a circular statistical plot that can display only one series of data.¶

In [25]:
f,ax=plt.subplots(1,2,figsize=(10,5))
df['Outcome'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Outcome')
ax[0].set_ylabel('')
sns.countplot(x=df.Outcome)
plt.title("Count Plot for Outcome")
N,P = df['Outcome'].value_counts()
print('Negative(0):',N)
print('Positive(1):',P)
plt.grid()
plt.show()
Negative(0): 500
Positive(1): 268

Out of total 768 people, 268 are diabetic(positive(1)) and 500 are non-diabetic(negative(0)).

In the outcome column, 1 represents diabetes positive and 0 represents diabetes negative.

The countplot tells tha the dataset is imbalanced, as number of patients who don't have diabetes is more than those who have diabetes.

Histogram¶

Histograms are one of the most common graphs used to display numeric data.

Distribution of the data- Whether the data is normally distributed or if it's skewed to the left or right

In [26]:
df.hist (bins=10,figsize=(10,10))
plt.show()

Scatter Plot¶

A scatter plot is a diagram where each value in the data set is represented by a dot.

In [27]:
from pandas.plotting import scatter_matrix
scatter_matrix(df,figsize=(20,20));

Pair plot¶

Pairplot allows us to plot pairwise relationships between variables within a dataset.

In [28]:
sns.pairplot(data=df,hue='Outcome')
plt.show()

Correlation heatmap¶

Correlation analysis is used to quantify the degree to which two variables are related. Through the correlation analysis, evaluate correlation coefficient that tells how much one variable changes when the other one does. Correlation analysis provides with a linear relationship between two variables. When we correlate feature variables with the target variable, we get to know that how much dependency is there between particular feature variables and target variable.

In [29]:
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))
g=sns.heatmap(df[top_corr_features].corr(),annot=True)

Observations: From the correaltion heatmap, we can see that there is a high correlation between Outcome and [Pregnancies, Glucose, BMI, Age, Insulin). We can select these features to accept input from the user and predict the outcome.

Split the data frame into X & y¶

In [30]:
X=df.iloc[:,0:-1]
y=df.iloc[:,-1]
In [31]:
classes = {'yes':1,'No':0}
In [32]:
X = df.drop('Outcome', axis=1)# X define independent
y = df['Outcome'] # y define dependent variable

print('Shape of  X =', X.shape)
print('Shape of y = ', y.shape)
Shape of  X = (768, 8)
Shape of y =  (768,)
In [33]:
X
Out[33]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6 148.0 72.0 35.000000 79.799479 33.6 0.627 50
1 1 85.0 66.0 29.000000 79.799479 26.6 0.351 31
2 8 183.0 64.0 20.536458 79.799479 23.3 0.672 32
3 1 89.0 66.0 23.000000 94.000000 28.1 0.167 21
4 0 137.0 40.0 35.000000 168.000000 43.1 2.288 33
... ... ... ... ... ... ... ... ...
763 10 101.0 76.0 48.000000 180.000000 32.9 0.171 63
764 2 122.0 70.0 27.000000 79.799479 36.8 0.340 27
765 5 121.0 72.0 23.000000 112.000000 26.2 0.245 30
766 1 126.0 60.0 20.536458 79.799479 30.1 0.349 47
767 1 93.0 70.0 31.000000 79.799479 30.4 0.315 23

768 rows × 8 columns

In [34]:
y
Out[34]:
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

splitting dataset into training and testing¶

In [60]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=7)
print('Shape of X_train=', X_train.shape)
print('Shape of y_train=', y_train.shape)
print('Shape of X_test=', X_test.shape)
print('Shape of y_test=', y_test.shape)
Shape of X_train= (614, 8)
Shape of y_train= (614,)
Shape of X_test= (154, 8)
Shape of y_test= (154,)

Apply Feature Scaling¶

In [61]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

Classification algorithms¶

Random Forest¶

In [62]:
from sklearn.ensemble import RandomForestClassifier
In [63]:
classifier_rf=RandomForestClassifier(n_estimators=100,criterion='gini')
classifier_rf.fit(X_train,y_train)
Out[63]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
In [64]:
rf_score = classifier_rf.score(X_test,y_test)
rf_score 
Out[64]:
0.8116883116883117
In [65]:
y_pred_rf = classifier_rf.predict(X_test)
y_pred_rf
Out[65]:
array([0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
      dtype=int64)

Confusion Matrix¶

In [66]:
conf_matrix = confusion_matrix(y_test, y_pred_rf)
In [67]:
conf_mat=confusion_matrix(y_test, y_pred_rf)
print(conf_mat)
[[86 11]
 [18 39]]
In [68]:
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", 
            xticklabels=classes, yticklabels=classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
In [69]:
from sklearn.metrics import  classification_report
# Print a classification report that includes precision, recall, and F1-score
print("Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=classes))
Classification Report:
              precision    recall  f1-score   support

         yes       0.83      0.89      0.86        97
          No       0.78      0.68      0.73        57

    accuracy                           0.81       154
   macro avg       0.80      0.79      0.79       154
weighted avg       0.81      0.81      0.81       154

K-fold cross validation¶

In [70]:
from sklearn.model_selection import cross_val_score, KFold 
# Create a k-fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(classifier_rf, X_train_sc, y_train, cv=kf)
print("Cross-validation scores:", scores)
classifier_rf_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",classifier_rf_kfold_mean_score )
Cross-validation scores: [0.78861789 0.77235772 0.73170732 0.74796748 0.71311475]
Mean Accuracy: 0.7507530321204852

Logistic Regression¶

In [71]:
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression(solver='liblinear')
classifier.fit(X_train,y_train)   
classifier.score(X_test,y_test)
Out[71]:
0.8051948051948052
In [72]:
y_test_prediction=classifier.predict(X_test)
y_test_prediction
Out[72]:
array([0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
      dtype=int64)

Confusion Matrix¶

In [73]:
conf_mat=confusion_matrix(y_test,y_test_prediction)
print(conf_mat)
[[92  5]
 [25 32]]
In [74]:
plt.figure(figsize=(12,6))
sns.heatmap(conf_mat,annot=True,fmt='d')
plt.title("Confusion Matrix of test data")
plt.xlabel("Predicted value")
plt.ylabel("Actual value")
Out[74]:
Text(120.72222222222221, 0.5, 'Actual value')
In [75]:
print(classification_report(y_test,y_test_prediction))
              precision    recall  f1-score   support

           0       0.79      0.95      0.86        97
           1       0.86      0.56      0.68        57

    accuracy                           0.81       154
   macro avg       0.83      0.75      0.77       154
weighted avg       0.82      0.81      0.79       154

K-fold cross validation¶

In [76]:
from sklearn.model_selection import cross_val_score, KFold
# Create a k-fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(classifier, X_train_sc, y_train, cv=kf)

print("Cross-validation scores:", scores)
logistic_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",logistic_kfold_mean_score )
Cross-validation scores: [0.73170732 0.80487805 0.7804878  0.74796748 0.79508197]
Mean Accuracy: 0.7720245235239238

Support Vector Machine¶

In [77]:
from sklearn.svm import SVC
svm_model = SVC(kernel='linear', C=1)
svm_model.fit(X_train_sc, y_train)
svm_model.score(X_test_sc,y_test)
Out[77]:
0.7857142857142857
In [78]:
y_pred = svm_model.predict(X_test_sc)
y_pred
Out[78]:
array([0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
      dtype=int64)

Confusion Matrix¶

In [79]:
conf_matrix = confusion_matrix(y_test, y_pred)
In [80]:
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=classes, yticklabels=classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
In [81]:
from sklearn.metrics import  classification_report
# Print a classification report that includes precision, recall, and F1-score
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=classes))
Classification Report:
              precision    recall  f1-score   support

         yes       0.78      0.92      0.84        97
          No       0.80      0.56      0.66        57

    accuracy                           0.79       154
   macro avg       0.79      0.74      0.75       154
weighted avg       0.79      0.79      0.78       154

K-fold cross validation¶

In [82]:
from sklearn.model_selection import cross_val_score, KFold
# Create a k-fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Perform k-fold cross-validation for linear SVM
scores = cross_val_score(svm_model, X_train_sc, y_train, cv=kf)
print("Cross-validation scores:", scores)
svm_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",svm_kfold_mean_score )
Cross-validation scores: [0.7398374  0.81300813 0.76422764 0.73170732 0.77868852]
Mean Accuracy: 0.7654938024790084

K nearest neighbour¶

In [83]:
from sklearn.neighbors import KNeighborsClassifier
In [84]:
classifier_knn = KNeighborsClassifier(n_neighbors=5)
classifier_knn.fit(X_train,y_train)
Out[84]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
In [85]:
knn_score = classifier_knn.score(X_test,y_test)
knn_score
Out[85]:
0.7272727272727273
In [86]:
y_pred_knn = classifier_knn.predict(X_test)
y_pred_knn
Out[86]:
array([0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
      dtype=int64)

Confusion Matrix¶

In [87]:
conf_matrix = confusion_matrix(y_test, y_pred_knn)
In [88]:
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=classes, yticklabels=classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
In [89]:
from sklearn.metrics import  classification_report
# Print a classification report that includes precision, recall, and F1-score
print("Classification Report:")
print(classification_report(y_test, y_pred_knn, target_names=classes))
Classification Report:
              precision    recall  f1-score   support

         yes       0.76      0.84      0.79        97
          No       0.66      0.54      0.60        57

    accuracy                           0.73       154
   macro avg       0.71      0.69      0.70       154
weighted avg       0.72      0.73      0.72       154

K-fold cross validation¶

In [90]:
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(classifier_knn, X_train, y_train, cv=kf)
print("Cross-validation scores:", scores)
classifier_knn_kfold_mean_score = np.mean(scores)
print("Mean Accuracy:",classifier_knn_kfold_mean_score )
Cross-validation scores: [0.73170732 0.69918699 0.69105691 0.69918699 0.75409836]
Mean Accuracy: 0.7150473144075703

XGBoost¶

In [91]:
from xgboost import XGBClassifier
xgb_model = XGBClassifier(gamma=0)
xgb_model.fit(X_train, y_train)
Out[91]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In [92]:
from sklearn import metrics
xgb_pred = xgb_model.predict(X_test)
print("Accuracy Score =", format(metrics.accuracy_score(y_test, xgb_pred)))
Accuracy Score = 0.7792207792207793

Data Visualization¶

In [93]:
def visualize_data(X, y, preds, title):
    plt.figure(figsize=(10, 6))
    plt.scatter(X[y == 0]['Glucose'], X[y == 0]['BMI'], label='No Diabetes', alpha=0.7)
    plt.scatter(X[y == 1]['Glucose'], X[y == 1]['BMI'], label='Diabetes', alpha=0.7)
    plt.scatter(X[preds == 1]['Glucose'], X[preds == 1]['BMI'], label='Predicted Diabetes', marker='x', c='red')
    plt.title(title)
    plt.xlabel('Glucose')
    plt.ylabel('BMI')
    plt.legend()
    plt.show()
    
visualize_data(X_test, y_test, y_pred_knn, 'K-Nearest Neighbors Predictions')
visualize_data(X_test, y_test, y_pred, 'Support Vector Machine Predictions')
visualize_data(X_test, y_test, y_pred_rf, 'Random Forest Predictions')
visualize_data(X_test, y_test, xgb_pred, 'XGBoost')
visualize_data(X_test, y_test, y_test_prediction, 'Logistic Regression')

Predict Patient Diabetes¶

In [94]:
patient1 =[1,89,66,23,94,28.1,0.167,21]
In [95]:
patient1=np.array([patient1])
patient1
Out[95]:
array([[ 1.   , 89.   , 66.   , 23.   , 94.   , 28.1  ,  0.167, 21.   ]])
In [96]:
pred=classifier.predict(patient1)
if pred[0]==1:
    print('Patient is diabetic')
else:
    print('Patient is not diabetic')
Patient is not diabetic
C:\Users\admin\anaconda3\lib\site-packages\sklearn\base.py:420: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  warnings.warn(

Conclusion¶

After using all these patient records, we are able to build a machine learning model (random forest – best one and logistic regression) to accurately predict whether or not the patients in the dataset have diabetes or not

In [ ]:
 
In [ ]: